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ABSTRACT 

The  well-known  chi-square  test  is  discussed  for  testing  equi- 
probability  when  the  expected  number  of  observations  per  cell  is  not  large. 
The  results  are  used  to  give  a justification  for  the  Poisson  Index  of 
Dispersion  test  of  fit  for  the  Poisson  distribution.  A chi-square  type 
statistic  is  studied  for  testing  equiprobability  and  Poisson-fit  when  the 
frequencies  associated  with  zero  are  missing  and  thus  the  sample  size  is 
unknown.  Some  applications  are  discussed  in  detail  where  the  samples 
are  incomplete. 
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1 . Introduction 

Consider  n independent  replicates  of  an  experiment  with  N 
different  outcomes.  An  old  statistical  problem  is  to  test  for  the  same 
probability,  l/N,  for  the  outcomes.  In  Section  2 we  will  make  some 
remarks  about  the  well-known  chi-square  statistic's  distribution  when 
the  expected  number  of  observation  per  outcome,  n/N,  is  not  large. 

In  Section  3 we  will  discuss  testing  fit  for  the  Poisson  distribution  using 
the  Poisson  Index  of  Dispersion,  i.e.  comparing  the  sample  mean  and 
the  sample  variance.  A rigorous  proof  of  its  asymptotic  distribution  is 
given  using  the  results  of  Section  2.  In  some  applications  complete 
samples  are  not  obtained.  In  Section  4 we  will  consider  the  multinomial 
and  Poisson  case  when  the  frequencies  associated  with  zero  are  missing 
and  therefore  N is  unknown.  A limit  theorem  is  derived  for  a statistic 
of  chi-square  type,  which  can  be  used  for  testing  equiprobability  for  the 
multinomial  or  fit  for  the  Poisson  distribution.  In  the  last  section  some 
applications  are  discussed  in  detail  where  incomplete  samples  are  obtained. 

Some  words  about  notation.  A multinomial  distribution  with  n 
repetitions  and  N equiprobable  outcomes  is  denoted  by  Mult(n,  l/N,  . . .,  l/N). 
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Po(\)  stands  for  the  Poisson  distribution  with  mean  N(m,<r^)  normal 

2 2 

with  mean  m and  variance  a , and  x (f)  the  usual  chi-square 
with  f degrees  of  freedom.  Convergence  in  distribution  is  denoted  by 
i(*  ••)-*••  • • To  formulate  limit  theorems  properly  we  should  use 
an  extra  index  v,  but  to  facilitate  notation  we  suppress  it. 


Let  £n)  be  Mult(n,  1/N,  . . . , 1/N) . Consider  Karl  Pearson's 

well-known  chi-square  statistic: 


? N 

Xn  N = N ^ ^k/n  ' n 1 UA) 

n ’ N k = 1 k 

For  fixed  N the  following  holds: 

T.(X2  ) - x2(N  - 1),  n -*  <»  . (2.2) 

n,  N 

A result  due  to  R.  A.  Fisher,  correcting  Pearson's  mistake  for  the  degrees 
of  freedom,  see  e.g.  Rao  [14]  for  a proof. 

The  chi-square  test  is  sometimes  used  in  such  a way  that  it  is 
more  logical  to  consider  asymptotic  results  when  both  n,  N — » in 
such  a manner  that  n/N  — \,  0 < \ < <*.  Using  different  methods  this 
case  has  been  investigated  by  Harris  and  Park  (7),  Holst  [8,9)  and 
Morris  [13},  showing  that 

r((Xn,N  " (N  " ' 1))1/2)  - N*0’1)  ‘ {Z‘  3) 

Note  the  independence  of  \ . As  we  have 

r((X2U)  - f)/(2f)1/2)  - N(0,1),  f - <*,  (2.4) 

it  is  reasonable  to  use  the  chi-square  approximation  even  if  n/N,  the 

expected  number  of  observations/class,  is  small.  This  theoretical 

argument  is  not  a consequence  of  the  usual  chi-square  theory  as  given 

2 2 

in  e.g.  [14].  By  comparing  moments  of  N with  those  of  x (N  - 1). 
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sec  Rao  ana  c hakravarti  [JS|,  one  can  expect  the  chi-square  approxima- 
tion to  be  accurate,  better  than  the  normal  indicated  by  the  limit  result 
above,  even  if  n/N  is  quite  small.  This  is  also  confirmed  by 
different  numerical  investigations,  see  Good,  Gover  and  Mitchell  [ 4 J , 
Katti  [ 11 1 , and  Zahn  and  Roberts  [18  1 . 


1.  The  Poisson  Index  ol  Dispersion 

Let  the  random  variables  r^,  . . r|^  be  i.i.d.  Po(\).  The 

Poisson  Index  of  Dispersion  is  defined  as 

N _ 2 - N N N 

TN  = Y lnk  ' n>  A = N \ ^ nk  " \ \ • 


(3.1) 


It  is  a well-known  fact  that 


N 

( nN)!L  nk  = n~  Mult(n,l/N,  ...,1/N)  . 

Thus,  using  the  notation  of  the  previous  section, 

N 

J(T  IE  n„  - n)  = t(X_  „)  . 


n,  N' 


(3.2) 


(3.  3) 


We  can  use  this  to  construct  a test  for  the  hypothesis  HQ  : • • •>  hjj 

N 

i.i.d.  Po(\);  consider  ^ n = n as  given,  reject  H if 

1 K 

T..  > c = constant.  This  test  is  an  old  one,  it  was  introduced  by 
N 

R.  A.  Fisher  in  the  first  edition  of  "Statistical  Methods  for  Research 

Workers".  It  is  discussed  in  many  papers  see  e.g.  Haight  (5,  p.  94], 

Kathirgamatamby  [10],  and  Rao  and  Chakravarti  [14].  Often  it  is  claimed 

that  from  the  usual  chi-square  theory  it  follows  that  c = x^(N  - 1),  the 

a 

upper  a-percentile  of  chi-square,  gives  a test  with  approximate  level  a 
in  large  samples.  This  argument  is  not  correct  as  N is  not  regarded 
as  fixed.  But,  by  the  strong  law  of  large  numbers,  with  probability  one, 

N 


ts  n,/N  = n/N  - \ 
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, ■ 


y 


(3.4) 


when  the  number  of  observations  N — ».  Therefore  we  actually  have 


the  situation  discussed  in  Section  2.  The  results  there  indicate  that 

c = x^(N  - 1)  is  a good  approximation  for  large  N.  The  significance 
a 

level  of  the  unconditional  test:  reject  H_,  if  T.r  > y^(N  - 1),  is 

u I\l  a 

given  by 

« N N 

Y P(Tn>X JN-DlL  h k = n)P(£  T,  = n)-«,  (3.5) 

n-0  N a 1 K 1 K 

when  N — Thus  we  have  obtained  a correct  theoretical  justification 
for  R.  A.  Fisher's  old  test  based  on  the  Poisson  Index  of  Dispersion. 


1.  Empty  classes  missing 


Consider  the  same  model  as  in  Section  2 i.e.  (£^,  . . , 4^) 
is  Mult(n,  1/N,  . . . , 1/N) . Here  we  suppose  that  the  4's  equal  to  0 
are  not  observed,  N is  unknown  and  the  ordering  of  the  4's  is 
imagined.  We  want  to  test  the  model.  This  type  of  problem  is  of 
practical  importance  as  we  will  see  from  the  examples  in  the  next  section. 

The  limit  result  of  Section  2,  when  n/N  -»  can  be  stated  as 
N \/? 

f.((N  ^ if/n  - n - N)/(2Nr - N(0,1)  . (4.1) 

1 K 

We  could  contemplate  using  an  estimate  of  N in  this  formula. 

Essentially  the  maximum  likelihood  estimator  of  n/N  is  given  by 

* * N 

(1  - exp(-  (n/N)  ))/(n/N)  ) = ^ 1(4  * 0)/n  , (4.2) 

1 K 

see  Darroch  [3],  Harris  [6],  Lewontin  and  Prout  [12],  Samuel  [16],  or 
Seber  [17,  p.  136] . Table  A.  3 in  [ 17  ] is  useful  for  solving  the  equation. 

The  following  limit  theorem  holds. 

Theorem.  If 

Uj eN)  ~ Mult(n,  1/N,  . . . , 1/N)  , (4.3) 

X.  = n/N  — X,  0 < x < °°,  n,  N — <*  , (4.  4) 

* * N 

U - e )/\  = X Itfu  * 0)/n  , (4.  5) 

1 K 

then,  when  n,  N — 

f(ZN)  - N(0,1)  , (4.6) 


HIM 


where 


v 7 * * * *2  \ * * 1/2 

d - n - nA  )(\  /(n(2  - \ /(ev  - 1 - X ))))  7 . (4.7) 

N y *k 

N , 

Remark.  1.  Note  that  £ \ and  n are  observable  quantities  even 

1 * 

if  N and  the  0's  are  unknown. 

Proof.  Using  the  equation  (4.  5)  and  the  lemma  below  (cf.  Remark  3) 
and  well-known  convergence  theorems,  see  e.g.  Rao  [14,  section  6a.  2], 
we  obtain  that  the  random  variable 


N 

2 * * 1/2 

Z1N  = ' n ' nA  )/N 


(4.8) 


has  the  same  asymptotic  distribution  as 


2 ^ ^ 

Z2N  = (elN  + e2N  ’ XN6  /(e  " 1 ' XN))/XN 


(4.9) 


where 


elN  = ^ (^k  " XN  " XN>/N 


(4.10) 


e2N  = £(i<sk*°)-(i-e  n))/n1/2. 


(4.11) 


From  the  lemma  we  find 


«(Z2N)  - N(0,  2 - \2/(eX  - 1 - X))  . 


(4.12) 


As  X * \ in  probability  it  follows  that 

X ' /(2  - X *2/(eX  * - 1 - X *))  - X/(2  - X2  /(eX  - 1 - X))  , (4.13) 


in  probability.  Therefore 

f(V  = I(Z1N  ' (X  /(XN(2  “ X /(e  - 1 - X ))))■ ')  - N(0,1)  , (4.14) 

which  proves  the  assertion.  ■ 

Lemma  Let  s,  t be  real  numbers.  If  X^  = n/N  -»  X,  0 < X < °°,  then 
N , N -X  N . / 

rus  y,  uk  - - v + 1 £ mk  *o)-a-e  m~LU) 

— N(0,  2s\2  - 2st\^e  X + t^e  2X(eX  - 1 - X))  . (4.15) 

Proof.  The  assertion  is  a special  case  of  general  theorems  on  multinomial 
sums,  see  Harris  [ 7 ) , Holst  [8,9]. 

Remark  2.  Taking  t = 0,  s = 1 gives  the  limit  result  of  Section  2. 

Remark  3.  Taking  5 = 0,  t = 1 gives 


N -X 

Z(Jj  (I(|.  * 0)  - (1  - e N))/N1/2)  - N(0,  e"2X(eX  - 1 - \))  , 
1 K 


(4.16) 


a limit  theorem  for  the  classical  occupancy  problem. 

As  in  Section  3 we  can  study  . . .,  i.i.d.  Po(\),  random  variables. 
Suppose  that  only  non-zero  g's  are  observed  and  N is  unknown.  To  con- 
struct a test  for  the  Poisson  model  we  can  consider  the  conditional  distribution 


(nr  •••,  nN) I L nk  = n~  U1,---,6N)~  Muit(n,i/N l/N)  , (4.17) 


and  use  the  result  above  (by  the  law  of  large  numbers  ^ g /N  -*  X). 
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5 . Applications  where  zeros  are  missing 

We  will  discuss  in  detail  three  examples  from  completely  different 
areas  of  application  to  indicate  that  the  problems  considered  in  Section  4 
are  of  practical  interest. 

Example  1 (Biology).  In  Craigh  [1]  a method  of  estimating  the  size  of 
a butterfly  population  is  considered,  see  also  [17,  p.  137].  The  butterflies 
were  caught  one  at  a time  and  released  after  marking.  A total  number  of 
341  different  butterflies  were  caught  of  which  258  occurred  once,  72 
twice  and  11  three  times  in  the  sample.  We  can  imagine  that  the  butterflies 
are  numbered  1,  2,  . . .,  N,  and  we  want  to  estimate  N.  Let 
denote  the  number  of  times  butterfly  k has  been  caught.  A possible 
model  is  that  (£j,  . . . , ~ Mult(n,  1/N,  . . . , 1/N).  For  the  data  above 

we  have 

N 

n = Yj  = 258-  1 + 62-  2 + 11-  3 = 435  , 

1 k 

N 

Y 1(4.  * o)  = 258  + 62  + 11  = 341  , (5.1) 

1 K 

N 

Y it  = 258<1  + 72*4  + u'9  = 645  • 

i K 

To  test  the  model  the  theorem  in  Section  4 can  be  used  in  the  following 
way.  First 

(1  - e"X*)A*  = 341/435  , (5.2) 


r 


5{t 

and  from  table  A.  3 in  [17]  we  find  X = 0.  5084.  After  some  computing 

we  obtain  = -1 . 32  which  can  be  compared  by  a N(0,  l)-percentile. 

As  a big  value  indicates  departure  from  assumptions  we  have  no  reason 

to  reject  the  model.  It  is  also  possible  to  estimate  the  expected  number 

of  butterflies  caught  1,  2,  3,  4 times.  One  finds  261 . 7 , 66.5,  11.3,  1.4, 

close  to  the  observed  values.  In  [l]  and  [17,  p.  157]  it  is  suggested 

that  these  numbers  can  be  used  in  a chi-square  test.  To  make  a 

rigorous  mathematical  justification  for  this  is  not  an  easy  matter.  Besides 

there  is  always  the  arbitrariness  about  pooling  classes.  This  problem 

does  not  occur  for  the  test  above.  In  [17,  p.  14],  it  is  claimed  that 

the  Poisson  Index  of  Dispersion  test  is  more  sensitive  than  the  usual 

chi-square.  One  may  conjecture  that  this  also  holds  true  for  the  test 

above  compared  with  chi-square.  This  matter  needs  further  investigation. 

Example  2 (Medicine).  In  an  example  given  in  Dahiya  and  Gross  [2], 

referring  to  an  epidemic  of  cholera  in  a village  of  India,  the  incomplete 

Poisson-model  of  Section  4 was  used.  With  f^  = # {k;  r|  = x}  we 

have  f,  = 32,  f_,  = 16,  f,  = 6,  f = 1 and  f =0  for  x > 5.  Computing 
1 ’ 2 3 ’ 4 x — 

* 

as  in  Example  1 we  find  X = 0.9722  and  = -0.51  indicating  good 
fit.  The  missing  number  f^  is  estimated  to  be  33  = [33.46j  . From 
Remark  3 in  Section  4 we  can  easily  construct  confidence  intervals  for  N 
and  therefore  for  n^.  For  a 9 5%  confidence  interval  we  just  solve 

(1  - e~  jA  = ^IU.*0)/n±  1.96  • ((eX  -l-x')/e  nX  V7  . (5.2) 

1 K 
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We  obtain  \ . = 0.742,  \ = 1.240  and  the  confidence  limits  for  N 

l a 

are  N = 86/0.742  = 116  and  N = 69.  The  9 5%  confidence  interval 
u ( 

for  n is  (69-55,  116-55)  = (14,  61),  slightly  different  from  that  given 
0 N 

in  ( 2).  As  the  distribution  of  } I(£,  * 0)  is  probably  better  approximated 

* 1 

❖ 

by  the  normal  than  N 's,  the  method  used  here  may  be  more  accurate 
than  using  normal  approximation  of  N as  in  l 2 ) . 

Example  i (History  and  Mintage).  In  a project  on  South  Asian  Monetary 
History  at  the  University  of  Wisconsin-Madison  the  "Qunduz  Hoard"  is 
studied.  This  is  a hoard  of  silver  coins  of  the  Indo-Greek  kings  of  ancient 
India  (Circa  200-100  B.C.).  It  derives  its  name,  the  Qunduz  hoard, 
from  its  find  spot  in  Afghanistan  and  is  especially  significant  as  it  is 
one  of  the  largest  intact  samples  of  the  coinage  of  the  Indo-Greeks  avail- 
able for  study,  and  has  been  published  with  excellent  photographic  plates 
enabling  the  researcher  to  detect  die  differences  accurately.  The  coins 
were  produced  using  a certain  tool,  called  a die,  which  was  worn  out 
after  a certain  number  of  coins.  This  number  is  approximately  known  and 
is  different  for  the  two  sides  of  the  coin,  the  so  called  obverse  and  reverse 
sides.  One  can  distinguish  between  coins  produced  by  different  dies. 
Knowing  the  number  of  dies  used  to  produce  a certain  coin  type,  the 
total  number  of  coins  of  this  type  produced  by  the  mint  could  be  estimated, 
giving  a measure  of  the  economic  activity  of  the  period.  The  problem  is 
therefore  to  estimate  the  number  of  dies. 
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In  a sample  let  ^ denote  the  number  of  coins  produced  by  die  k. 

If  the  coins  in  the  "Qunduz  Hoard"  can  be  considered  as  a random  sample 
from  the  population  of  coins  used  in  those  days,  the  dies  were  used 
about  the  same  number  of  times  (much  larger  than  the  sample  sizes)  and 
the  classification  of  the  coins  is  accurate,  then  (4^,  . . ., 

~ Multfn,  1/N,  . . . , 1/N),  would  be  an  adequate  probabilistic  model. 

Here  N is  the  unknown  number  of  dies,  the  zeros  are  not  observed  and 
the  labeling  is  imagined.  Using  the  method  described  above  we  can 
estimate  N.  The  underlying  assumptions  implying  equiprobability  are 
crucial,  so  the  model  should  be  tested  for  fit.  This  can  be  uone  as  in 
the  other  examples.  The  Poisson  model  would  also  be  reasonable  in  this 
example.  The  same  statistical  analysis  is  used  in  either  model. 

As  an  illustration  let  us  consider  the  following  data  from  the  "Qunduz 
Hoard".  Using  the  notation  of  Example  2 we  have  for  "obverse"  sides  of 
"Heliocles"  coins:  f = 102,  f = 26,  f?  = 8,  = 2,  = ffe  = f?  = 1, 

5*j  $ 

f =0  for  x > 8.  We  find  X = 0.790  7 and  using  this  N = 258. 
x ~ 

But  = 6.15,  a highly  significant  value  indicating  bad  fit.  Estimates 
of  expected  values  are  92.  5,  36.5,  7.6,  1.9,  0.3,  0.04,  0.004  showing 

that  f,  = f = 1 are  unlikely  in  a random  sample.  Adopting  the  standard 

o 7 

rule  for  chi-square,  that  the  expected  number  of  observations  in  a class 

should  be  at  least  5,  we  must  pool  classes  3-7  giving  a chi-square  of 

2 

4.17  with  3 - 1 - 1 = 1 degree  of  freedom.  As  *0  Q1_(l)  = 3.  84  < 4. 17 
2 

< Xq  01(1)  = 6.  63  there  is  significant  deviation  but  just  on  the  5%  -level; 
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cf.  the  conjecture  in  the  end  of  Example  1.  As  the  underlying  model  is 

very  much  in  doubt  one  can  expect  the  estimate  of  N to  be  biased.  For 

the  reverse  side  of  the  same  coins  we  have:  f^  = 156,  = 19,  f^  =.  2, 

f^  = 1,  f =0  for  x > 5.  From  the  same  statistical  analysis  we  get 
* * 

\ = 0.2792,  N = 7 31 , Zjj  = 1.  57  and  expected  values  154.4,  21.6, 

2.0,  0.14.  So  the  fit  is  adequate.  A 9 5%  confidence  interval  for  N 
is  (536,  1107).  We  can  remark  that  the  difference  of  fit  between  the 
"obverse"  and  "reverse"  sides  can  be  explained.  A plausible  explanation 
is,  that  the  number  of  coins  produced  by  the  same  die  varies  much  more 
for  the  obverse  side,  because  it  was  much  more  difficult  to  make  an 
obverse  die  than  a reverse. 

[ I would  like  to  thank  Mr.  John  Deyell,  Department  of  South  Asian 

History,  and  Mr.  Richard  Bittman,  Department  of  Mathematics  and  the 
Statistical  Laboratory,  University  of  Wisconsin-Madison,  for  providing  me 
with  this  example  and  for  discussions  on  it.  The  inspiration  for  this  paper 
arose  largely  in  connection  with  this  application. 
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